Goto

Collaborating Authors

 munchausen reinforcement learning


Munchausen Reinforcement Learning

Neural Information Processing Systems

Bootstrapping is a core mechanism in Reinforcement Learning (RL). Most algorithms, based on temporal differences, replace the true value of a transiting state by their current estimate of this value. Yet, another estimate could be leveraged to bootstrap RL: the current policy. Our core contribution stands in a very simple idea: adding the scaled log-policy to the immediate reward. We show that, by slightly modifying Deep Q-Network (DQN) in that way provides an agent that is competitive with the state-of-the-art Rainbow on Atari games, without making use of distributional RL, n-step returns or prioritized replay. To demonstrate the versatility of this idea, we also use it together with an Implicit Quantile Network (IQN). The resulting agent outperforms Rainbow on Atari, installing a new State of the Art with very little modifications to the original algorithm. To add to this empirical study, we provide strong theoretical insights on what happens under the hood -- implicit Kullback-Leibler regularization and increase of the action-gap.


Review for NeurIPS paper: Munchausen Reinforcement Learning

Neural Information Processing Systems

Additional Feedback: After Authors' Reponse: I still find the paper's analysis regarding action-gaps a bit weak, and the authors' response didn't help much in that regard. I think their action-gap analysis needs to be considered under the new findings of (van Seijen et al., 2019); increasing the action-gap is not important on its own, rather it's the homogeneity of the action-gaps across the states that is important. While I still stand by my verdict of accepting this paper, in light of other reviews, I think the paper's writing should be toned down a bit in regards to its theoretical novelty and claims about empirical results (e.g. the first non-dist-RL to beat a dist-RL). Q1: To the best of my knowledge, IQN in Dopamine also uses Double Q-learning. Is this also the case for your M-IQN agent?


Review for NeurIPS paper: Munchausen Reinforcement Learning

Neural Information Processing Systems

In this submission, a new bootstrapping optimization technique is proposed, based on the idea of adding the log-policy to the immediate reward. This is shown to bring strong empirical gains, and the theoretical analysis helps understand why. Although reviewers remained divided even after an active discussion period (7, 7, 5, 5), I believe this is a paper worth publishing at NeurIPS. Simple ideas bringing significant improvements, like this one, are typically those most impactful. I also appreciate the efforts made to better understand the theoretical properties of the proposed algorithm, beyond the basic intuition.


Munchausen Reinforcement Learning

Neural Information Processing Systems

Bootstrapping is a core mechanism in Reinforcement Learning (RL). Most algorithms, based on temporal differences, replace the true value of a transiting state by their current estimate of this value. Yet, another estimate could be leveraged to bootstrap RL: the current policy. Our core contribution stands in a very simple idea: adding the scaled log-policy to the immediate reward. We show that, by slightly modifying Deep Q-Network (DQN) in that way provides an agent that is competitive with the state-of-the-art Rainbow on Atari games, without making use of distributional RL, n-step returns or prioritized replay.